3 research outputs found
Arena: A General Evaluation Platform and Building Toolkit for Multi-Agent Intelligence
Learning agents that are not only capable of taking tests, but also
innovating is becoming a hot topic in AI. One of the most promising paths
towards this vision is multi-agent learning, where agents act as the
environment for each other, and improving each agent means proposing new
problems for others. However, existing evaluation platforms are either not
compatible with multi-agent settings, or limited to a specific game. That is,
there is not yet a general evaluation platform for research on multi-agent
intelligence. To this end, we introduce Arena, a general evaluation platform
for multi-agent intelligence with 35 games of diverse logics and
representations. Furthermore, multi-agent intelligence is still at the stage
where many problems remain unexplored. Therefore, we provide a building toolkit
for researchers to easily invent and build novel multi-agent problems from the
provided game set based on a GUI-configurable social tree and five basic
multi-agent reward schemes. Finally, we provide Python implementations of five
state-of-the-art deep multi-agent reinforcement learning baselines. Along with
the baseline implementations, we release a set of 100 best agents/teams that we
can train with different training schemes for each game, as the base for
evaluating agents with population performance. As such, the research community
can perform comparisons under a stable and uniform standard. All the
implementations and accompanied tutorials have been open-sourced for the
community at https://sites.google.com/view/arena-unity/
The Costly Dilemma: Generalization, Evaluation and Cost-Optimal Deployment of Large Language Models
When deploying machine learning models in production for any
product/application, there are three properties that are commonly desired.
First, the models should be generalizable, in that we can extend it to further
use cases as our knowledge of the domain area develops. Second they should be
evaluable, so that there are clear metrics for performance and the calculation
of those metrics in production settings are feasible. Finally, the deployment
should be cost-optimal as far as possible. In this paper we propose that these
three objectives (i.e. generalization, evaluation and cost-optimality) can
often be relatively orthogonal and that for large language models, despite
their performance over conventional NLP models, enterprises need to carefully
assess all the three factors before making substantial investments in this
technology. We propose a framework for generalization, evaluation and
cost-modeling specifically tailored to large language models, offering insights
into the intricacies of development, deployment and management for these large
language models.Comment: 11 page